library(tidyverse)
library(lubridate)
library(patchwork)
library(forcats)
library(ozmaps)
library(sf)
library(cartogram)
library(ggthemes)
library(broom)
library(readxl)
library(nullabor)
library(plotly)
library(sugarbag)
In Melbourne we have been in a strict lockdown since July. Each week we get our hopes up that restrictions might be eased, and once again these hopes are dashed by the announcement Sunday Oct 25, keeping the restrictions a little longer because of another outbreak in the northwest of the city. The data we have collected here are the case counts by Victorian local government area (LGA) since the beginning of July. We will examine the spatiotemporal distribution of these counts.
Working with spatial data is always painful! It almost always requires some ugly code. Part of the reason for the difficulty is the use of special data objects, that describe maps. There are several different choices, and some packages and tools use one, and others use another, so not all tools work together. The sf package is a recent endeavour that helps enormously, but some tools still use other forms, and when you run into errors this might be the reason - it can be hard to tell. Another reason is that map objects can be very large, which makes sense for accurate mapping, but for data analysis and visualisation, we’d rather have smaller, even if slightly inaccurate, spatial objects. It can be helpful to thin out map data before doing further analysis - you need special tools for this, eg mapshapr. We don’t need this for the exercises here. Another problem commonly encountered is that there are numerous coordinate systems, and types of projections of the 3D globe into a 2D canvas. We have become accustomed to lat/long but like time its an awkward scale to compute on because a translation from E/W and N/S to positive and negative values is needed. More commonly a Universal Transverse Mercator (UTM) is the standard but its far less intuitive to use.
The code for all the analysis is provided for you. We recommend that you run the code in steps to see what it is doing, why the mutating and text manipulations are necessary. Talk about the code with each other to help you understand it.
melb_lga_covid.csv contains the cases by LGA. Read the data in and inspect result. You should find that some variables are type chr because “null” has been used to code entries on some days. This needs fixing, and also missings should be converted to 0. Why does it make sense to substitute missings with 0, here?NAs really have to be 0s. Its likely that the cells were left blank when numbers were recorded, left blank because there were no cases that day.
# Read the data
# Replace null with 0, for three LGAs
# Convert to long form to join with polygons
# Make the date variables a proper date
# Set NAs to 0, this is a reasonable assumption
covid <- read_csv("https://raw.githubusercontent.com/numbats/eda/master/data/melb_lga_covid.csv") %>%
mutate(Buloke = as.numeric(ifelse(Buloke == "null", "0", Buloke))) %>%
mutate(Hindmarsh = as.numeric(ifelse(Hindmarsh == "null", "0", Hindmarsh))) %>%
mutate(Towong = as.numeric(ifelse(Towong == "null", "0", Towong))) %>%
pivot_longer(cols = Alpine:Yarriambiack, names_to="NAME", values_to="cases") %>%
mutate(Date = ydm(paste0("2020/",Date))) %>%
mutate(cases=replace_na(cases, 0))
This is cumulative data. Once re-calculated as daily counts, and artifact that emerges is that ther are a few small negatives. this could occur if the previous days numbers have been adjusted as new information on cases or duplicates were found.
# Check the case counts
covid %>% filter(NAME == "Brimbank") %>%
ggplot(aes(x=Date, y=cases)) +
geom_point()
# Case counts are cumulative, so take lags to get daily case counts
covid <- covid %>%
group_by(NAME) %>%
mutate(new_cases = cases - dplyr::lag(cases))
# Check the case counts
covid %>% filter(NAME == "Brimbank") %>%
ggplot(aes(x=Date, y=new_cases)) +
geom_col()
ozmaps package. We need to fix some names of LGAs because there are duplicated LGA names, and there is one mismatch in names from the COVID data and the ozmaps data (Colac Otway). If the COVID data had been provided with a unique LGA code it would have helped in merging with the polygon data.# Read the LGA data from ozmaps package.
# This has LGAs for all of Australia.
# Need to filter out Victoria LGAs, avoiding LGAs
# from other states with same name, and make the names
# match covid data names. The regex equation is
# removing () state and LGA type text strings
# Good reference: https://r-spatial.github.io/sf/articles/sf1.html
data("abs_lga")
vic_lga <- abs_lga %>%
mutate(NAME = ifelse(NAME == "Latrobe (M) (Tas.)", "LatrobeM", NAME)) %>%
mutate(NAME = ifelse(NAME == "Kingston (DC) (SA)", "KingstonSA", NAME)) %>%
mutate(NAME = ifelse(NAME == "Bayside (A)", "BaysideA", NAME)) %>%
mutate(NAME = str_replace(NAME, " \\(.+\\)", "")) %>%
mutate(NAME = ifelse(NAME == "Colac-Otway", "Colac Otway", NAME))
vic_lga <- st_transform(vic_lga, 3395)
# 3395 is EPSG CRS, equiv to WGS84 mercator,
# see https://spatialreference.org/ref/epsg/?page=28
# cartogram() needs this to be set
sf object so the geom_sf will automatically grab the geometry from the object to make the spatial polygons.The high count LGAs are all in Melbourne, mostly in the western suburbs.
# Select a day when cases were high
covid_08_01 <- covid %>%
filter(Date == ymd("2020-08-01"))
# Join covid data to polygon data, remove LGAs with
# missing values which should leave just Vic LGAs
vic_lga_covid <- vic_lga %>%
left_join(covid_08_01, by="NAME") %>%
filter(!is.na(cases))
# Make choropleth map, with appropriate colour palette
ggplot(vic_lga_covid) +
geom_sf(aes(fill = new_cases, label=NAME), colour="white") +
scale_fill_distiller("Cases", palette = "YlOrRd",
direction=1) +
theme_map() +
theme(legend.position="bottom")
# Make it interactive
# plotly::ggplotly()
VIF2019_Population_Service_Ages_LGA_2036.xlsx has been extracted from the Vic Gov web site. It is a complicated xlsx file, with the data in sheet 3, and starting 13 rows down. The readxl package is handy here to extract the population data needed. You’ll need to join the population counts to the map data to make a cartogram. Once you have the transformed polygon data, the same plotting code can be used, as created the choropleth map.Interestingly, the population of the LGAs is quite differnt, with densely populated LGAs in Melbourne. These get greatly enlarged by the algorithm, and LGA polygons from the rural areas are much smaller. It makes it easier to see the LGAs with high case counts, and also all of the LGAs in the city with low counts. (Note: the white inner city polygons are not actually LGAs, just unfortunate artifacts of the cartogram transformation.
# Incorporate population data to make cartogram
# Population from https://www.planning.vic.gov.au/land-use-and-population-research/victoria-in-future/tab-pages/victoria-in-future-data-tables
# Data can be downloaded from https://github.com/numbats/eda/blob/master/data/VIF2019_Population_Service_Ages_LGA_2036.xlsx
pop <- read_xlsx(here::here("data/VIF2019_Population_Service_Ages_LGA_2036.xlsx"), sheet=3, skip=13, col_names = FALSE) %>%
rename(NAME = `...4`, pop=`...22`) %>%
filter(NAME != "Unincorporated Vic") %>%
mutate(NAME = str_replace(NAME, " \\(.+\\)", "")) %>%
mutate(NAME = ifelse(NAME == "Colac-Otway", "Colac Otway", NAME))
vic_lga_covid <- vic_lga_covid %>%
left_join(pop, by="NAME")
# Compute additional statistics
vic_lga_covid <- vic_lga_covid %>%
group_by(NAME) %>%
mutate(cases_per10k = max(new_cases/pop*10000, 0),
lnew_cases = log10(new_cases - min(new_cases) + 1)) %>%
ungroup()
# Make a contiguous cartogram
vic_lga_covid_carto <- cartogram_cont(vic_lga_covid, "pop")
# This st_cast() is necessary to get plotly to work
vic_lga_covid_carto <- st_cast(vic_lga_covid_carto, "MULTIPOLYGON")
ggplot(vic_lga_covid_carto) +
geom_sf(aes(fill = cases, label=NAME), colour="white") +
scale_fill_distiller("Cases", palette = "YlOrRd",
direction=1) +
theme_map() +
theme(legend.position="bottom")
# ggplotly()
The temporal trend is easier to read from the cartograms. The same LGAs are primarily affected throughout the time. It started in the northwest, in week 27, peaking in wekk 32. In week 29 it spread to a couple of eastern LGAs. Numbers have been low in all LGAs for the past four weeks. (Note: it could be interesting to use a log scale for counts, and also examine count for 10k people instead of raw counts. On the log scale Melbourne appears to be much more of a hotspot overall.)
# Aggregate counts to weekly, to examine temporal trend,
# and re-compute cases per week
covid_week <- covid %>%
mutate(week = week(Date)) %>%
group_by(NAME, week) %>%
summarise(wk_cases = sum(new_cases, na.rm=TRUE)) %>%
ungroup() %>%
left_join(pop, by="NAME") %>%
group_by(NAME, week) %>%
mutate(wk_cases_per10k = max(wk_cases/pop*10000, 0)) %>%
ungroup()
# Need to join geometry on again
vic_lga_covid_week <- vic_lga %>%
left_join(covid_week, by="NAME") %>%
filter(!is.na(wk_cases))
# Draw the faceted map
ggplot(vic_lga_covid_week) +
geom_sf(aes(fill = wk_cases), colour="white") +
scale_fill_distiller("Cases", palette = "YlOrRd",
direction=1) +
facet_wrap(~week, ncol=4) +
theme_map() +
theme(legend.position="bottom")
# Join to the cartogram
vic_lga_covid_week_carto <-
vic_lga_covid_carto %>%
left_join(covid_week, by="NAME")
# Make the facetted cartogram
ggplot(vic_lga_covid_week_carto) +
geom_sf(aes(fill = wk_cases), colour="white") +
scale_fill_distiller("Cases", palette = "YlOrRd",
direction=1) +
facet_wrap(~week, ncol=4) +
theme_map() +
theme(legend.position="bottom")
The spatiotemporal trend changes a little. A specific hotspot in one LGA is more visible in week 28, and several other hotspots appear, including Colac-Otway which had an outbreak in week 31 in a lamb processing facility. The cases per 10k is the better statistic to use, for all the plots, because it makes the values comparable and independent of the population differences between LGAs.
# Make the facetted cartogram
ggplot(vic_lga_covid_week_carto) +
geom_sf(aes(fill = wk_cases_per10k), colour="white") +
scale_fill_distiller("Cases", palette = "YlOrRd",
direction=1) +
facet_wrap(~week, ncol=4) +
theme_map() +
theme(legend.position="bottom")
PURELY OUT OF INTEREST, EXTRA CODE TO MAKE A HEXMAP OF VICTORIA IS PROVIDED
This code uses the sugarbag package, which is currently only available on GitHub.
# Spatial coordinates need to be in long/lat
vlc_latlong <- st_transform(vic_lga_covid, crs = "+proj=longlat +datum=WGS84")
# Placement of hexmaps depends on position relative to
# Melbourne central
data(capital_cities)
vic_lga_hexmap <- create_hexmap(
shp = vlc_latlong,
sf_id = "NAME",
focal_points = capital_cities, verbose = TRUE)
# This shows the centroids of the hexagons
# ggplot(vic_lga_hexmap, aes(x=hex_long, y=hex_lat)) +
# geom_point()
# Hexagons are made with the `fortify_hexagon` function
vic_lga_covid_hexmap <- vic_lga_hexmap %>%
fortify_hexagon(sf_id = "NAME", hex_size = 0.1869) %>%
left_join(covid_08_01, by="NAME") %>%
filter(!is.na(cases))
ggplot() +
geom_sf(data=vlc_latlong,
fill = "grey90", colour = "white", size=0.1) +
geom_polygon(data=vic_lga_covid_hexmap,
aes(x=long, y=lat, group=hex_id,
fill = new_cases,
colour = new_cases), size=0.2) +
scale_fill_distiller("Cases", palette = "YlOrRd",
direction=1) +
scale_colour_distiller("Cases", palette = "YlOrRd",
direction=1) +
theme_map() +
theme(legend.position="bottom")
# Now join to the weekly data to make faceted hexmaps
vic_lga_covid_week_hexmap <- vic_lga_hexmap %>%
fortify_hexagon(sf_id = "NAME", hex_size = 0.1869) %>%
full_join(covid_week, by="NAME")
ggplot() +
geom_sf(data=vlc_latlong,
fill = "grey90", colour = "white", size=0.1) +
geom_polygon(data=vic_lga_covid_week_hexmap,
aes(x=long, y=lat, group=hex_id,
fill = wk_cases,
colour = wk_cases), size=0.1) +
scale_fill_distiller("Cases", palette = "YlOrRd",
direction=1) +
scale_colour_distiller("Cases", palette = "YlOrRd",
direction=1) +
facet_wrap(~week, ncol=4) +
theme_map() +
theme(legend.position="bottom")
skittle <- read_csv("https://raw.githubusercontent.com/njtierney/skittles/master/data/skittles.csv")
Each person tasted 10 skittles.
skittle %>%
with(table(person))
## person
## a b c
## 10 10 10
If a person cannot distinguish flavours then they will randomly choose one of the five flavours. So the probability that they select the correct flavour is 1/5.
Suppose \(X\) is the number of skittles that they correctly identified the flavour. Then assuming that the person cannot distinguish flavours and order of tasting the skittles does not matter, \(X \sim B(10, 0.2)\). Then \(P(X = 2) = {10 \choose 2} 0.2^5 0.8^5\approx 0.3\). So there’s only about 30% chance such an event happens!
dbinom(2, 10, 0.2)
## [1] 0.3019899
Suppose \(X\) is the number of skittles that a person identified the flavour correctly out of 30 skittles. Suppose each tasting is independent and has a equal probability of identifying the flavour correctly; we denote this probability as \(p\). We test the hypotheses: \(H_0: p = 0.2\) vs. \(H_1: p > 0.2\). Under \(H_0\), \(X\sim B(30, 0.2)\) and therefore the \(p\)-value is \(P(X \geq 15) \approx 0.0002\). The \(p\)-value is small so the data supports that people can correctly identify the flavour of a skittle!
sum(skittle$correct)
## [1] 15
1 - pbinom(14, 30, 0.2)
## [1] 0.0002312256
To construct a test statistic, we need to construct a summary statistic with some known distribution under the null hypothesis (if using a parametric approach) with large (or extreme) values indicating rejection of the null hypothesis. Suppose that \(X_1\), \(X_2\) and \(X_3\) are the number of skittles out of 10 that person a, b and c, respectively, correctly identified. If each tasting is independent, then \(X_1 \sim B(10, p_1)\), \(X_2 \sim B(10, p_2)\) and \(X_3 \sim B(10, p_3)\) where \(p_i\) is the probability that the \(i\)-th person correctly identifies the flavour of a skittle. Now under \(H_0\) you may assume that \(p_1 = p_2 = p_3 = 0.2\) and assuming each person is independent, \(X_1 + X_2 + X_3 \sim B(30, 0.2)\). Same as (d)! However, if we know remove the assumption that each tasting is independent (so the order of tasting does matter), then the distribution of the test statistic does not hold true any longer.
gtile <- skittle %>%
ggplot(aes(factor(order), person, fill = factor(correct))) +
geom_tile(color = "black", size = 2) +
coord_equal() +
scale_fill_viridis_d() +
labs(x = "Order", y = "Person", fill = "Correct")
gtile
The null plot is constructed as follows.
set.seed(1)
method <- null_dist("correct", "binom", list(size = 1, prob = 0.2))
gtile %+% method(skittle)
nullabor or otherwise) of 20 plots. Ask your peers, family or friends which plot looks different.lineup_df <- lineup(method, true = skittle)
## decrypt("EnXL zNTN Z6 PJfZTZJ6 WY")
gtile %+% lineup_df +
facet_wrap(~.sample) +
guides(fill = FALSE) +
theme(axis.text = element_blank(),
axis.title = element_blank())
decrypt("bhMq KJPJ 62 sSQ6P6S2 7m")
## [1] "8XDl fopo nJ 6S4npnSJ NA"
We suppose that each person has the same ability to identify the data plot. If we let \(X\) be the number of people who correctly identified the data plot in the lineup, then \(X \sim B(100, p)\). The visual inference \(p\)-value is calculated from testing the hypotheses \(H_0: p = 0.05\) vs \(H_1: p \neq 0.05\), and so is \(P(X\geq 76) \approx 0\). The visual inference \(p\)-value is very small so there is strong evidence to believe that the structure in the data deviates away from the null distribution!
1 - pbinom(75, 100, 0.05)
## [1] 0
gbar <- skittle %>%
mutate(person = fct_reorder(person, correct, sum)) %>%
group_by(person) %>%
summarise(correct = sum(correct)) %>%
ggplot(aes(person, correct)) +
geom_col() +
labs(x = "Person", y = "Correct") +
geom_hline(yintercept = 2, linetype = "dashed")
gbar
gbar %+% lineup_df +
facet_wrap(~.sample) +
guides(fill = FALSE) +
theme(axis.text = element_blank(),
axis.title = element_blank())
decrypt("bhMq KJPJ 62 sSQ6P6S2 7m")
## [1] "8XDl fopo nJ 6S4npnSJ NA"
The estimated power of visual statistic in (f) is 76% and for the barplot is 26%. So the difference in power is 50%.